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Abstract 

We present two methods for unsupervised 
segmentation of words into morpheme- 
like units. The model utilized is espe- 
cially suited for languages with a rich 
morphology, such as Finnish. The first 
method is based on the Minimum Descrip- 
tion Length (MDL) principle and works 
online. In the second method, Max- 
imum Likelihood (ML) optimization is 
used. The quality of the segmentations is 
measured using an evaluation method that 
compares the segmentations produced to 
an existing morphological analysis. Ex- 
periments on both Finnish and English 
corpora show that the presented methods 
perform well compared to a current state- 
of-the-art system. 

1 Introduction 

According to linguistic theory, morphemes are con- 
sidered to be the smallest meaning-bearing ele- 
ments of language, and they can be defined in a 
language-independent manner However, no ade- 
quate language-independent definition of the word 
as a unit has been agreed upon (Karlsson, 1998, 
p. 83). If effective methods can be devised for the 
unsupervised discovery of morphemes, they could 
aid the formulation of a linguistic theory of mor- 
phology for a new language. 

It seems that even approximative automated mor- 
phological analysis would be beneficial for many 



natural language applications dealing with large vo- 
cabularies. For example, in text retrieval it is cus- 
tomary to preprocess texts by returning words to 
their base forms, especially for morphologically rich 
languages. 

Moreover, in large vocabulary speech recognition, 
predictive models of language are typically used for 
selecting the most plausible words suggested by an 
acoustic speech recognizer (see, e.g., Bellegarda, 
2000). Consider, for example the estimation of the 
standard n-gram model, which entails the estima- 
tion of the probabilities of all sequences of n words. 
When the vocabulary is very large, say 100 000 
words, the basic problems in the estimation of the 
language model are: (1) If words are used as ba- 
sic representational units in the language model, the 
number of basic units is very high and the estimated 
word n-grams are poor due to sparse data. (2) Due 
to the high number of possible word forms, many 
perfectly valid word forms will not be observed at 
all in the training data, even in large amounts of text. 
These problems are particularly severe for languages 
with rich morphology, such as Finnish and Turkish. 
For example, in Finnish, a single verb may appear in 



thousands of different forms ( [Karlsson, 1987[ ). 

The utilization of morphemes as basic representa- 
tional units in a statistical language model instead of 
words seems a promising course. Even a rough mor- 
phological segmentation could then be sufficient. 
On the other hand, the construction of a comprehen- 
sive morphological analyzer for a language based 
on linguistic theory requires a considerable amount 
of work by experts. This is both slow and expen- 
sive and therefore not applicable to all languages. 



Table 1 : The morphological structure of the Finnish 
word for 'also for [the] coffee drinker'. 



Word 


kah vinj uoj allekin 


Morphs 


kahvi 


n 


juo 


ja 


He 


kin 


Transl. 


coffee 


of 


drink 


-er 


for 


also 



The problem is further compounded as languages 
evolve, new words appear and grammatical changes 
take place. Consequently, it is important to develop 
methods that are able to discover a morphology for 
a language based on unsupervised analysis of large 
amounts of data. 

As the morphology discovery from untagged cor- 
pora is a computationally hard problem, in practice 
one must make some assumptions about the struc- 
ture of words. The appropriate specific assumptions 
are somewhat language-dedependent. For example, 
for English it may be useful to assume that words 
consist of a stem, often followed by a suffix and pos- 
sibly preceded by a prefix. By contrast, a Finnish 
word typically consists of a stem followed by multi- 
ple suffixes. In addition, compound words are com- 
mon, containing an alternation of stems and suf- 
fixes, e.g., the word kahvin juo jallekin(Engl. 
'also for [the] coffee drinker' ; cf. Table More- 
over, one may ask, whether a morphologically com- 
plex word exhibits some hierarchical structure, or 
whether it is merely a flat concatenation of stems 
and suffices. 

1.1 Previous Work on Unsupervised 
Segmentation 

Many existing morphology discovery algorithms 
concentrate on identifying prefixes, suffixes and 
stems, i.e., assume a rather simple inflectional mor- 
phology. 

Dejean (1998) concentrates on the problem of 
finding the list of frequent affixes for a language 
rather than attempting to produce a morphological 
analysis of each word. Following the work of Zellig 
Harris he identifies possible morpheme boundaries 
by looking at the number of possible letters follow- 
ing a given sequence of letters, and then utilizes fre- 
quency limits for accepting morphemes. 



Goldsmith (2000) concentrates on stem-i-suffix- 
languages, in particular Indo-European languages, 
and tries to produce output that would match as 
closely as possible with the analysis given by a hu- 
man morphologist. He further assumes that stems 
form groups that he calls signatures, and each sig- 
nature shares a set of possible affixes. He applies an 
MDL criterion for model optimization. 

The previously discussed approaches consider 
only individual words without regard to their con- 
texts, or to their semantic content. In a different ap- 
proach, Schone and Jurafsky (2000) utilize the con- 
text of each term to obtain a semantic representa- 
tion for it using LSA. The division to morphemes 
is then accepted only when the stem and stem-i-affix 
are sufficiently similar semantically. Their method 
is shown to improve on the performance of Gold- 
smith's Linguistica on CELEX, a morphologically 
analyzed English corpus. 

In the related field of text segmentation, one can 
sometimes obtain morphemes. Some of the ap- 
proaches remove spaces from text and try to identify 
word boundaries utilizing e.g. entropy-based mea- 



sures, as in ( [Redlich, 1993| ). 

Word induction from natural language text with- 
out word bound aries is also studied in ( Deligne 
and Bimbot, 1997; |Hua, 2000D , whereMDL-based 
model optimization measures are used. Viterbi or 
the forward-backward algorithm (an EM algorithm) 
is used for improving the segmentation of the cor- 
pus0. 

Also de Marcken (1995; 1996) studies the prob- 
lem of learning a lexicon, but instead of optimiz- 
ing the cost of the whole corpus, as in (Redlich, 
1993; iHua, 2000D , de Marcken starts with sentences. 
Spaces are included as any other characters. 

U tterances are also analyzed in (K it and Wilks, 
1999) where optimal segmentation for an utterance 
is sought so that the compression effect over the seg- 
ments is maximal. The compression effect is mea- 
sured in what the authors call Description Length 
Gain, defined as the relative reduction in entropy. 
The Viterbi algorithm is used for searching for the 
optimal segmentation given a model. The input ut- 



'For a comp rehensive view of Finnish morphology, see 



(Karlsson, 198 



I" 



The regular EM procedure only maximizes the likelihood 
of the data. To follow the MDL approach where model cost is 
also optimized, Hua includes the model cost as a penalty term 
on pure ML probabilities. 



terances include spaces and punctuation as ordinary 
characters. The method is evaluated in terms of pre- 
cision and recall on word boundary prediction. 

Brent presents a general, modular probabilistic 
model structure for word discovery ( Brent, 1999| ). 
He uses a minimum representation length criterion 
for model optimization and applies an incremental, 
greedy search algorithm which is suitable for on-line 
learning such that children might employ. 

1.2 Our Approach 

In this work, we use a model where words may con- 
sist of lengthy sequences of segments. This model 
is especially suitable for languages with agglutina- 
tive morphological structure. We call the segments 
morphs and at this point no distinction is made be- 
tween stems and affixes. 

The practical purpose of the segmentation is 
to provide a vocabulary of language units that is 
smaller and generalizes better than a vocabulary 
consisting of words as they appear in text. Such a 
vocabulary could be utilized in statistical language 
modeling, e.g., for speech recognition. Moreover, 
one could assume that such a discovered morph vo- 
cabulary would correspond rather closely to linguis- 
tic morphemes of the language. 

We examine two methods for unsupervised learn- 
ing of the model, presented in Sections ^ and|3[ The 
cost function for the first method is derived from the 
Minimum Description Length principle from classic 
information theory ( Rissanen, 1989] ), which simul- 
taneously measures the goodness of the representa- 
tion and the model complexity. Including a model 
complexity term generally improves generalization 
by inhibiting overlearning, a problem especially se- 
vere for sparse data. An incremental (online) search 
algorithm is utilized that applies a hierarchical split- 
ting strategy for words. In the second method the 
cost function is defined as the maximum likelihood 
of the data given the model. Sequential splitting is 
applied and a batch learning algorithm is utilized. 

In Section ^ we develop a method for evaluat- 
ing the quality of the morph segmentations produced 
by the unsupervised segmentation methods. Even 
though the morph segmentations obtained are not in- 
tended to correspond exactly to the morphemes of 
linguistic theory, a basis for comparison is provided 
by existing, linguistically motivated morphological 



analyses of the words. 

Both segmentation methods are applied to the 
segmentation of both Finnish and English words. 
In Section |5|, we compare the results obtained from 
our methods to results produced by Goldsmith's Lin- 
guistica on the same data. 

2 Method 1: Recursive Segmentation and 
MDL Cost 

The task is to find the optimal segmentation of the 
source text into morphs. One can think of this as 
constructing a model of the data in which the model 
consists of a vocabulary of morphs, i.e. the code- 
book and the data is the sequence of text. We try to 
find a set of morphs that is concise, and moreover 
gives a concise representation for the data. This is 
achieved by utilizing an MDL cost function. 

2. 1 Model Cost Using MDL 

The total cost consists of two parts: the cost of the 
source text in this model and the cost of the code- 
book. Let M be the morph codebook (the vocab- 
ulary of morph types) and D = mim2 ■ ■ ■ rUn the 
sequence of morph tokens that makes up the string 
of words. We then define the total cost C as 



C 



Cost(Source text) + Cost (Codebook) 

- logp(mi) + ^ /c * l{mj) 



tokens 



types 



The cost of the source text is thus the negative log- 
likelihood of the morph, summed over all the morph 
tokens that comprise the source text. The cost of the 
codebook is simply the length in bits needed to rep- 
resent each morph separately as a string of charac- 
ters, summed over the morphs in the codebook. The 
length in characters of the morph rrij is denoted by 
I {rrij ) and k is the number of bits needed to code a 
character (we have used a value of 5 since that is suf- 
ficient for coding 32 lower-case letters). For p{mi) 
we use the ML estimate, i.e., the token count of nii 
divided by the total count of morph tokens. 

2.2 Search Algorithm 

The online search algorithm works by incremen- 
tally suggesting changes that could improve the cost 
function. Each time a new word token is read 
from the input, different ways of segmenting it into 



morphs are evaluated, and the one with minimum 
cost is selected. 

Recursive segmentation. The search for the opti- 
mal morph segmentation proceeds recursively. First, 
the word as a whole is considered to be a morph and 
added to the codebook. Next, every possible split of 
the word into two parts is evaluated. 

The algorithm selects the split (or no split) that 
yields the minimum total cost. In case of no spUt, 
the processing of the word is finished and the next 
word is read from input. Otherwise, the search for a 
split is performed recursively on the two segments. 
The order of splits can be represented as a binary tree 
for each word, where the leafs represent the morphs 
making up the word, and the tree structure describes 
the ordering of the splits. 

During model search, an overall hierarchical data 
structure is used for keeping track of the current 
segmentation of every word type encountered so 
far. Let us assume that we have seen seven in- 
stances of linja-auton (Engl, 'of [the] bus') 
and two instances of autonkul jetta jalla- 
kaan (Engl, 'not even by/at/with [the] car driver'). 
Figure |I| then shows a possible structure used for 
representing the segmentations of the data. Each 
chunk is provided with an occurrence count of the 
chunk in the data set and the split location in this 
chunk. A zero split location denotes a leaf node, i.e., 
a morph. The occurrence counts flow down through 
the hierachical structure, so that the count of a child 
always equals the sum of the counts of its parents. 
The occurrence counts of the leaf nodes are used for 
computing the relative frequencies of the morphs. 
To find out the morph sequence that a word consists 
of, we look up the chunk that is identical to the word, 
and trace the split indices recursively until we reach 
the leafs, which are the morphs. 

Note that the hierarchical structure is used only 
during model search: It is not part of the final model, 
and accordingly no cost is associated with any other 
nodes than the leaf nodes. 

Adding and removing morphs. Adding new 
morphs to the codebook increases the codebook 
cost. Consequently, a new word token will tend to 
be split into morphs already listed in the codebook, 
which may lead to local optima. To better escape lo- 
cal optima, each time a new word token is encoun- 
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linja-auton 



autonkulj ettaj allakaan 
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n 




kulj ettaj alia kaan 




Figure 1: Hierarchical structure of the seg- 
mentation of the words linja-auton and 
autonkul jetta jallakaan. The boxes repre- 
sent chunks. Boxes with bold text are morphs, and 
are part of the codebook. The numbers above each 
box are the split location (to the left of the colon 
sign) and the occurrence count of the chunk (to the 
right of the colon sign). 



tered, it is resegmented, whether or not this word has 
been observed before. If the word has been observed 
(i.e. the corresponding chunk is found in the hierar- 
chical structure), we first remove the chunk and de- 
crease the counts of all its children. Chunks with 
zero count are removed (remember that removal of 
leaf nodes corresponds to removal of morphs from 
the codebook). Next, we increase the count of the 
observed word chunk by one and re-insert it as an 
unsplit chunk. Finally, we apply the recursive split- 
ting to the chunk, which may lead to a new, different 
segmentation of the word. 

"Dreaming". Due to the online learning, as the 
number of processed words increases, the quality 
of the set of morphs in the codebook gradually im- 
proves. Consequently, words encountered in the be- 
ginning of the input data, and not observed since, 
may have a sub-optimal segmentation in the new 
model, since at some point more suitable morphs 
have emerged in the codebook. We have therefore 
introduced a 'dreaming' stage: At regular intervals 
the system stops reading words from the input, and 
instead iterates over the words already encountered 
in random order. These words are resegmented and 
thus compressed further, if possible. Dreaming con- 



tinues for a limited time or until no considerable de- 
crease in the total cost can be observed. Figure |2| 
shows the development of the average cost per word 
as a function of the increasing amount of source text. 



Corpus: Finnish newspaper text 100 000 words 




20 40 60 80 100 



Number of words read [x 1000 words] 

Figure 2: Development of the average word cost 
when processing newspaper text. Dreaming, i.e., the 
re-processing of the words encountered so far, takes 
place five times, which can be seen as sudden drops 
on the curve. 

3 Method 2: Sequential Segmentation and 
ML Cost 

3. 1 Model Cost Using ML 

In this case, we use as cost function the likelihood 
of the data, i.e., P{data\model). Thus, the model 
cost is not included. This corresponds to Maximum- 
Likelihood (ML) learning. The cost is then 

Cost(Source text) = — logp(mj), (1) 

morph tokens 

where the summation is over all morph tokens in the 
source data. As before, for p{mi) we use the ML 
estimate, i.e., the token count of rrii divided by the 
total count of morph tokens. 

3.2 Search Algorithm 

In this case, we utilize batch learning where an EM- 
like (Expectation-Maximization) algorithm is used 
for optimizing the model. Moreover, splitting is not 
recursive but proceeds linearly. 



1. Initialize segmentation by splitting words into 
morphs at random intervals, starting from the 
beginning of the word. The lengths of intervals 
are sampled from the Poisson distribution with 
A = 5.5. If the interval is larger than the num- 
ber of letters in the remaining word segment, 
the splitting ends. 

2. Repeat for a number of iterations: 

(a) Estimate morph probabilities for the given 
splitting. 

(b) Given the current set of morphs and their 
probabilities, re-segment the text using the 
Viterbi algorithm for finding the segmen- 
tation with lowest cost for each word. 

(c) If not the last iteration: Evaluate the seg- 
mentation of a word against rejection cri- 
teria. If the proposed segmentation is not 
accepted, segment this word randomly (as 
in the Initialization step). 

Note that the possibility of introducing a random 
segmentation at step (c) is the only thing that allows 
for the addition of new morphs. (In the cost function 
their cost would be infinite, due to ML probability 
estimates). In fact, without this step the algorithm 
seems to get seriously stuck in suboptimal solutions. 

Rejection criteria. (1) Rare moiphs. Reject the 
segmentation of a word if the segmentation contains 
a morph that was used in only one word type in the 
previous iteration. This is motivated by the fact that 
extremely rare morphs are often incorrect. (2) Se- 
quences of one-letter morphs. Reject the segmenta- 
tion if it contains two or more one-letter morphs in 
a sequence. For instance, accept the segmentation 
halua + n (Engl. 'Iwanf,i.e.. present stem of 
the verb 'to want' followed by the ending for the first 
person singular), but reject the segmentation halu 
+ a + n (stem of the noun 'desire' followed by 
a strange sequence of endings). Long sequences of 
one-letter morphs are usually a sign of a very bad 
local optimum that may even get worse in future it- 
erations, in case too much probability mass is trans- 
ferred onto these short morphs]^ 

^Nevertheless, for Finnish there do exist some one-letter 
morphemes that can occur in a sequence. However, these mor- 
phemes can be thought of as a group that belongs together: e.g., 



4 Evaluation Measures 

We wish to evaluate the method quantitatively from 
the following perspectives: (1) correspondence with 
linguistic morphemes, (2) efficiency of compression 
of the data, and (3) computational efficiency. The ef- 
ficiency of compression can be evaluated as the total 
description length of the corpus and the codebook 
(the MDL cost function). The computational effi- 
ciency of the algorithm can be estimated from the 
running time and memory consumption of the pro- 
gram. However, the linguistic evaluation is in gen- 
eral not so straightforward. 

4.1 Linguistic Evaluation Procedure 

If a corpus with marked morpheme boundaries is 
available, the linguistic evaluation can be computed 
as the precision and recall of the segmentation. Un- 
fortunately, we did not have such data sets at our dis- 
posal, and for Finnish such do not even exist. In ad- 
dition, it is not always clear exactly where the mor- 
pheme boundary should be placed. Several alterna- 
tives may be possible, cf. Engl, hope + dvs. hop 
+ ed, (past tense of to hope). 

Instead, we utilized an existing tool for providing 
a morphological analysis, although not a segmenta- 
tion, of words, based on the two-level morphology 
of Koskenniemi (1983). The analyzer is a finite-state 
transducer that reads a word form as input and out- 
puts the base form of the word together with gram- 
matical tags. Sample analyses are shown in Figure ^. 

The tag set consists of tags corresponding to 
morphological affixes and other tags, for example, 
part-of-speech tags. We preprocessed the analyses 
by removing other tags than those corresponding 
to affixes, and further split compound base forms 
(marked using the # character by the analyzer) into 
their constituents. As a result, we obtained for each 
word a sequence of labels that corresponds well to 
a linguistic morphemic analysis of the word. A la- 
bel can often be considered to correspond to a single 
word segment, and the labels appear in the order of 
the segments. 

The following step consists in retrieving the seg- 
mentation produced by one of the unsupervised seg- 
mentation algorithms, and trying to align this seg- 
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Output 


Word 


Base form 


Tags 


easily 


EASY 


<DER:ly> ADV 


bigger 


BIG 


A CMP 


hours' 


HOUR 


N PL GEN 


auton 


AUTO 


N SG GEN 


puutaloja 


PUU#TALO 


N PL PTV 


tehnyt 


TEHDA 


V ACT PCP2 SG 



Figure 3: Morphological analyses for some En- 
glish and Finnish word forms. The Finnish words 
are auton {car's), puutaloja {[some] wooden 
houses) and tehnyt {[has] done). The tags are A 
(adjective), ACT (active voice), ADV (adverb), CMP 
(comparative), GEN (genitive), N (noun), PCP2 (2nd 
participle), PL (plural), PTV (partitive), SG (singu- 
lar), V (verb), and <DER : ly> (-ly derivative). 

mentation with the desired morphemic label se- 
quence (cf. Figure 

A good segmentation algorithm will produce 
morphs that align gracefully with the correct mor- 
phemic labels, preferably producing a one-to-one 
mapping. A one-to-many mapping from morphs 
to labels is also acceptable, when a morph forms a 
common entity, such as the suffix -ja in puutaloja, 
which contains both the plural and partitive element. 
By contrast, a many-to-one mapping from morphs 
to a label is a sign of excessive splitting, e.g., t + 
alo for talo (cf. English h + ouse for house). 
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PUU 


TALO 
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PTV 


Morph sequence 


puu 


t alo 


ja 



the Finnish talo + j + a (plural partitive of 'house'); can 
also be thought of as talo + j a. 



Figure 4: Alignment of obtained morph sequences 
with their respective correct morphemic analyses. 
We assume that the segmentation algorithm has 
split the word W^^er into the morphs bigg + er, 
hours' into hour + s + ' and puutaloja into 
puu + t + alo + ja. 

Alignment procedure. We align the morph se- 
quence with the morphemic label sequence using 



dynamic programming, namely Viterbi alignment, 
to find the best sequence of mappings between 
morphs and morphemic labels. Each possible pair 
of morph/morphemic label has a distance associated 
with it. For each segmented word, the algorithm 
searches for the alignment that minimizes the to- 
tal alignment distance for the word. The distance 
d{M, L) for a pair of morph M and label L is given 
by: 

d(M,L) = - log (2) 

CM 

where cm,l is the number of word tokens in which 
the morph M has been aligned with the label L; and 
Cm is the number of word tokens that contain the 
morph M in their segmentation. The distance mea- 
sure can be thought of as the negative logarithm of a 
conditional probability P{L\M). This indicates the 
probability that a morph M is a realisation of a mor- 
pheme represented by the label L. Put another way, 
if the unsupervised segmentation algorithm discov- 
ers morphs that are allomorphs of real morphemes, a 
particular allomorph will ideally always be aligned 
with the same (correct) morphemic label, which 
leads to a high probability P{L\M), and a short dis- 
tance d{M, In contrast, if the segmentation al- 
gorithm does not discover meaningful morphs, each 
of the segments will be aligned with a number of dif- 
ferent morphemic labels throughout the corpus, and 
as a consequence, the probabilities will be low and 
the distances high. 

We then utilize the EM algorithm for iteratively 
improving the alignment. The initial alignment that 
is used for computing initial distance values is ob- 
tained through a string matching procedure: String 
matching is efficient for aligning the stem of the 
word with the base form (e.g., the morph puu with 
the label PUU, and the morphs t + alo with the 
label TALO). The suffix morphs that do not match 
well with the base form labels will end up aligned 
somehow with the morphological tags (e.g., the 
morph ja with the labels PL + PTV). 



This holds especially for allomorphs of 'stem morphemes', 
e.g., it is possible to identify the English morpheme easy with 
a probability of one from both its allomorphs: easy and easi. 
However, suffixes, in particular, can have several meanings, 
e.g., the English suffix can mean either the plural of nouns 
or the third person singular of the present tense of verbs. 



Comparison of methods. In order to compare two 
segmentation algorithms, the segmentation of each 
is aligned with the linguistic morpheme labels, and 
the total distance of the alignment is computed. 
Shorter total distance indicates better segmentation. 

However, one should note that the distance mea- 
sure used favors long morphs. If a particular "seg- 
mentation" algorithm does not split one single word 
of the corpus, the total distance can be zero. In such 
a situation, the single morph that a word is com- 
posed of is aligned with all morphemic labels of the 
word. The morph M, i.e., the word, is unique, which 
means that all probabilities P{L\M) are equal to 
one: e.g., the morph puutalo ja is always aligned 
with the labels PUU + TALO -i- PL + PTV and no 
other labels, which yields the probabilities P(PUU | 
puutalo j a) = P(TALO | puutalo j a) = P(PL | 
puutalo j a) = P(PTV | puutalo j a) = 1. 

Therefore, part of the corpus should be used as 
training data, and the rest as test data. Both data sets 
are segmented using the unsupervised segmentation 
algorithms. The training set is then used for estimat- 
ing the distance values d{M,L). These values are 
used when the test set is aligned. The better seg- 
mentation algorithm is the one that yields a better 
alignment distance for the test set. 

For moiph/label pairs that were never observed in 
the training set, a maximum distance value is as- 
signed. A good segmentation algorithm will find 
segments that are good building blocks of entirely 
new word forms, and thus the maximum distance 
values will occur only rarely. 

5 Experiments and Results 

We compared the two proposed methods as well as 
Goldsmith's program Linguistics^ on both Finnish 
and English corpora. The Finnish corpus consisted 
of newspaper text from CSC|. A morphosyntac- 
tic analysis of the text was performed using the 
Conexor FDG parserQ All characters were con- 
verted to lower case, and words containing other 
characters than a through z and the Scandinavian 
letters d, d and o were removed. Other than mor- 
phemic tags were removed from the morphological 

'http://humanities.uchicago.edu/faculty/goldsmith/Linguist- 
ica2000/ 

*http://www.csc.fi/kielipankki/ 
'http://www.conexor.fi/ 



analyses of the words. The remaining tags corre- 
spond to inflectional affixes (i.e. endings and mark- 
ers) and clitics. Unfortunately the parser does not 
distinguish derivational affixes. The first 100 000 
word tokens were used as training data, and the fol- 
lowing 100 000 word tokens were used as test data. 
The test data contained 34 821 word types. 

The English corpus consisted of mainly newspa- 
per text from the Brown corpus]^. A morphologi- 
cal analysis of the words was performed using the 
Lingsoft ENGTWOL analyzer^ In case of multi- 
ple alternative morphological analyses, the shortest 
analysis was selected. All characters were converted 
to lower case, and words containing other characters 
than a through z, an apostrophe or a hyphen were 
removed. Other than morphemic tags were removed 
from the morphological analyses of the words. The 
remaining tags correspond to inflectional or deriva- 
tional affixes. A set of 100 000 word tokens from the 
corpus sections Press Reportage and Press Editorial 
were used as training data. A separate set of 100 000 
word tokens from the sections Press Editorial, Press 
Reviews, Religion, and Skills Hobbies were used as 
test data. The test data contained 12 053 word types. 

Test results for the three methods and the two lan- 
guages are shown in Table ^ We observe different 
tendencies for Finnish and English. For Finnish, 
there is a correlation between the compression of 
the corpus and the linguistic generalization capac- 
ity to new word forms. The Recursive splitting with 
the MDL cost function is clearly superior to the Se- 
quential splitting with ML cost, which in turn is su- 
perior to Linguistica. The Recursive MDL method 
is best in terms of data compression: it produces the 
smallest morph lexicon (codebook), and the code- 
book naturally occupies a small part of the total cost. 
It is best also in terms of the linguistic measure, the 
total alignment distance on test data. Linguistica, on 
the other hand, employs a more restricted segmenta- 
tion, which leads to a larger codebook and to the fact 
that the codebook occupies a large part of the total 
MDL cost. This also appears to lead to a poor gen- 
eralization ability to new word forms. The linguis- 
tic alignment distance is the highest, and so is the 
percentage of aligned morph/morphemic label pairs 



The Brown corpus is available at the Linguistic Data Con- 



sortium at 
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http://www.ldc.upenn.edu/ 



http://www.lingsoft.fi/ 



that were never observed in the training set. On the 
other hand, Linguistica is the fastest program]^ 

Also for English, the Recursive MDL method 
achieves the best alignment, but here Linguistica 
achieves nearly the same result. The rate of com- 
pression follows the same pattern as for Finnish, 
in that Linguistica produces a much larger morph 
lexicon than the methods presented in this pa- 
per. In spite of this fact, the percentage of unseen 
morph/morphemic label pairs is about the same for 
all three methods. This suggests that in a morpho- 
logically poor language such as English a restrictive 
segmentation method, such as Linguistica, can com- 
pensate for new word forms - that it does not rec- 
ognize at all - with old, familiar words, that it "gets 
just right". In contrast, the methods presented in this 
paper produce a morph lexicon that is smaller and 
able to generalize better to new word forms but has 
somewhat lower accuracy for already observed word 
forms. 

Visual inspection of a sample of words. In an 

attempt to analyze the segmentations more thor- 
oughly, we randomly picked 1000 different words 
from the Finnish test set. The total number of occur- 
rences of these words constitute about 2.5% of the 
whole set. We inspected the segmentation of each 
word visually and classified it into one of three cat- 
egories: (1) correct and complete segmentation (i.e., 
all relevant morpheme boundaries were identified), 
(2) correct but incomplete segmentation (i.e., not all 
relevant morpheme boundaries were identified, but 
no proposed boundary was incorrect), (3) incorrect 
segmentation (i.e., some proposed boundary did not 
correspond to an actual morpheme boundary). 

The results of the inspection for each of the three 
segmentation methods are shown in Table ^ The 
Recursive MDL method performs best and segments 
about half of the words correctly. The Sequential 
ML method comes second and Linguistica third with 
a share of 43% correctly segmented words. When 
considering the incomplete and incorrect segmenta- 
tions the methods behave differently. The Recursive 
MDL method leaves very common word forms un- 
split, and often produces excessive splitting for rare 

'"Note, however, that the computing time comparison with 
Linguistica is only approximate since it was a compiled pro- 
gram run on Windows whereas the two other methods were im- 
plemented as Perl scripts run on Linux. 



Table 2: Test results for the Finnish and English corpus. Method names are abbreviated: Recursive seg- 
mentation and MDL cost (Rec. MDL), Sequential segmentation and ML cost (Seq. ML), and Linguistica 
(Ling.). The total MDL cost measures the compression of the corpus. However, the cost is computed accord- 
ing to Equation ([I]), which favors the Recursive MDL method. The final number of morphs in the codebook 
(#morphs in codebook) is a measure of the size of the morph "vocabulary". The relative codebook cost 
gives the share of the total MDL cost that goes into coding the codebook. The alignment distance is the total 
distance computed over the sequence of morph/morphemic label pairs in the test data. The unseen aligned 
pairs is the percentage of all aligned morph/label pairs in the test set that were never observed in the training 
set. This gives an indication of the generalization capacity of the method to new word forms. 



Language 


Finnish 


English 


Method 


Rec. MDL 


Seq. ML 


Ling. 


Rec. MDL 


Seq. ML 


Ling. 


Total MDL cost [bits] 


2.09M 


2.27M 


2.88M 


1.26M 


1.34M 


1.44M 


#morphs in codebook 


6302 


10 977 


22 075 


3836 


4888 


8153 


Relative codebook cost 


10.16% 


15.27% 


36.81% 


9.42% 


10.90% 


19.14% 


Alignment distance 


768k 


817k 


link 


313k 


444k 


332k 


Unseen aligned pairs 


23.64% 


20.20% 


37.22% 


18.75% 


19.67% 


20.94% 


Time [sec] 


620 


390 


180 


130 


80 


30 



Table 3: Estimate of accuracy of morpheme bound- 
ary detection based on visual inspection of a sample 
of 2500 Finnish word tokens. 



Method 


Correct 


Incomplete 


Incorrect 


Rec. MDL 


49.6% 


29.7% 


20.6% 


Seq. ML 


47.3% 


15.3% 


37.4% 


Linguistica 


43.1% 


24.1% 


32.8% 



words. The Sequential ML method is more prone to 
excessive splitting, even for words that are not rare. 
Linguistica, on the other hand, employs a more con- 
servative splitting strategy, but makes incorrect seg- 
mentations for many common word forms. 

The behaviour of the methods is illustrated by ex- 
ample segmentations in Table Often the Recur- 
sive MDL method produces complete and correct 
segmentations. However, both it and the Sequential 
ML method can produce excessive splitting, as is 
shown for the latter, e.g. affecti + on + at 
+ e. In contrast, Linguistica refrains from splitting 
words when they should be split, e.g., the Finnish 
compound words in the table. 

6 Discussion of the Model 

Regarding the model, there is always room for im- 
provement. In particular, the current model does 



not allow representation of contextual dependencies, 
i.e., that some morphs appear only in particular con- 
texts (allomorphy). Moreover, languages have rules 
regarding the ordering of stems and affixes (morpho- 
tax). However, the current model has no way of rep- 
resenting such contextual dependencies. 

7 Conclusions 

In the experiments the online method with the MDL 
cost function and recursive splitting appeared most 
successful especially for Finnish, whereas for En- 
glish the compared methods were rather equal in 
performance. This is likely to be partially due to 
the model structure of the presented methods which 
is especially suitable for languages such as Finnish. 
However, there is still room for considerable im- 
provement in the model structure, especially regard- 
ing the representation of contextual dependencies. 

Considering the two examined model optimiza- 
tion methods, the Recursive MDL method per- 
formed consistently somewhat better. Whether this 
is due to the cost function or the splitting strategy 
cannot be deduced based on these experiments. In 
the future, we intend to extend the latter method to 
utilize an MDL-like cost function. 



Table 4: Some English and Finnish word segmentations produced by the three methods. The Finnish words 
are elainlaakari (veterinarian, lit. animal doctor), elainmuseo (zoological museum, Ut. animal 
museum), elainpuisto (zoological park, lit. animal park), and elaintarha (zoo, lit. animal garden). 
The suffixes -lie, -n, -on, and -sta are linguistically correct. (Note that in the Sequential ML method 
the rejection criteria mentioned are not applied on the last round of Viterbi segmentation. This is why two 
one letter morphs appear in a sequence in the segmentation elain + tarh + a + n.) 



Recursive MDL 


Sequential ML 


Linguistica 


affect 

affect + ing 
affect + ing + ly 
affect + ion 
affect + ion + ate 
affect + ion + s 
affect + s 


affect 

affect + ing 
affect + ing + ly 
affecti + on 
affecti + on + at + e 
affecti + on + s 
affect + s 


affect 

affect + ing 
affect + ing + ly 
affect + ion 
affect + ion + ate 
affect + ion + s 
affect + s 


elain + laakari 
elain + laakari + He 
elain + museo + n 
elain + museo + on 
elain + puisto + n 
elain + puisto + sta 
elain + tar + han 


elain + laakari 
elain + laakari + He 
elain + museo + n 
elain + museo + on 
elain + puisto + n 
elain + puisto + sta 
eliiin + tarh + a + n 


elainlaakari 
elainlaakari + He 
elainmuseo + n 

elainmuseo + on 
elainpuisto + n 
elainpuisto + sta 
elaintarh + an 
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